Project
Author: Qi Meng Last modified data: 2018-02-25 ========================================================
The diamention of Proserper Loan Dataset
# length(pr)
dim(pr)
## [1] 113937 81
In total there are 81 variables that corresponding to each loan In order better understand the dataset, in this analysis, 19 variables will be selected.
pr <- subset(pr, select = c('LoanStatus',
'BorrowerAPR',
'BorrowerRate',
'LenderYield',
'ProsperScore',
'BorrowerState',
'Occupation',
'EmploymentStatus',
'IsBorrowerHomeowner',
'TotalCreditLinespast7years',
'TotalInquiries',
'BankcardUtilization',
'AvailableBankcardCredit',
'IncomeRange',
'IncomeVerifiable',
'LoanOriginalAmount',
'LoanOriginationDate',
'MonthlyLoanPayment',
'Investors'
))
str(pr)
## 'data.frame': 113937 obs. of 19 variables:
## $ LoanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
## $ BorrowerAPR : num 0.165 0.12 0.283 0.125 0.246 ...
## $ BorrowerRate : num 0.158 0.092 0.275 0.0974 0.2085 ...
## $ LenderYield : num 0.138 0.082 0.24 0.0874 0.1985 ...
## $ ProsperScore : num NA 7 NA 9 4 10 2 4 9 11 ...
## $ BorrowerState : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
## $ Occupation : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
## $ EmploymentStatus : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
## $ IsBorrowerHomeowner : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
## $ TotalCreditLinespast7years: int 12 29 3 29 49 49 20 10 32 32 ...
## $ TotalInquiries : num 3 5 1 1 9 2 0 16 6 6 ...
## $ BankcardUtilization : num 0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
## $ AvailableBankcardCredit : num 1500 10266 NA 30754 695 ...
## $ IncomeRange : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
## $ IncomeVerifiable : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ LoanOriginationDate : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
## $ MonthlyLoanPayment : num 330 319 123 321 564 ...
## $ Investors : int 258 1 41 158 20 1 1 1 1 1 ...
# table(pr$ProsperScore)
ggplot(aes(ProsperScore), data = pr) +
scale_x_continuous(breaks = seq(1, 11, 1) ) +
geom_histogram(stat = 'count', binwidth = 1)
## Warning: Ignoring unknown parameters: binwidth, bins, pad
## Warning: Removed 29084 rows containing non-finite values (stat_count).
# From the histogram, the majority of the ProserScore are 4, 6, and 8
From the histogram, the majority of the ProserScore are 4, 6, and 8
ggplot(aes(LoanStatus), data = pr) +
geom_histogram(stat = 'count') +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
## Warning: Ignoring unknown parameters: binwidth, bins, pad
Most of the loan are current and the second is the completed loans
range_list <- c('$0','$1-24,999','$25,000-49,999','$50,000-74,999','$75,000-99,999',
'$100,000+','Not employed','Not displayed')
pr$income_range <- factor(pr$IncomeRange,levels = range_list)
Create a new variabel income_range to better format the IncomeRange variable
ggplot(aes(pr$EmploymentStatus), data = pr) +
geom_histogram(stat = 'count') +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
## Warning: Ignoring unknown parameters: binwidth, bins, pad
The loan data mostly coming from people who works employed or full time.
ggplot(aes(pr$income_range), data = pr) +
geom_histogram(stat = 'count') +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
## Warning: Ignoring unknown parameters: binwidth, bins, pad
People whose income is between $25,000 and $100,000 applied for the loans.
The ProsperLoan data have 113937 overvations and 19 variables ProsperSocre: A custom risk score built using historical Prosper data. The score ranges from 1-10, with 10 being the best, or lowest risk score. Applicable for loans originated after July 2009. Loanstatus: Completed, Current, Past Due (1-15 days), Defaulted, Chargedoff, Past Due (16-30 days), Cancelled, Past Due (61-90 days), Past Due (31-60 days), Past Due (91-120 days)
EmploymentStatus:Self-employed, Employed, Not available, Full-time,, Other, Not employed, Part-time, Retired
IncomeRange: $25,000-49,999, $50,000-74,999, Not displayed, $100,000+, $75,000-99,999, $1-24,999,Not employed, $0
$0 $1-24,999, $100,000+, $25,000-49,999, $50,000-74,999, $75,000-99,999, Not displayed, Not employed
The main feature is the ProsperSocre, which measue the risk ability of the loan itself, verus the BorrowerRate
The loan applicants’ occupation,income, bankcard utilization, available bank card credit and other variables might impact the rick score when valued by th Prosper Company
Yes, I created the new variable range_new to reorder the income range variable in a ascending order.
Yes. I did select 19 variables out of 81 in total. The reason for this is that not all the variables are revelent in determining the Prosper Score.
ggplot(aes(income_range, ProsperScore ), data = subset(pr, !is.na(ProsperScore))) +
geom_jitter(alpha = .05, shape = 21, colour = '#F79420') +
theme(axis.text.x = element_text(angle = 60, hjust = 1))
For applicants with lower income, the Prosper Score seems lower than 6, and for people with higher income, the ProsperScore seems higher, which means less risk.
ggplot(aes(income_range, ProsperScore ), data = subset(pr, !is.na(ProsperScore))) +
geom_boxplot()+
theme(axis.text.x = element_text(angle = 60, hjust = 1))
From the box plot we can see it more clearly, the median of people of income greater than $50,000 is much higher than people with income less than $50,000
ggplot(aes(EmploymentStatus, ProsperScore ), data = subset(pr, !is.na(ProsperScore))) +
geom_jitter(alpha = .05, shape = 21, colour = '#F79420') +
theme(axis.text.x = element_text(angle = 60, hjust = 1))
If we compare ProsperScore with employment status, clearly, full time employees will have greater ProperScore, and therefore, less risky to lend money to them
ggplot(aes( ProsperScore, BorrowerRate * 100 ), data = subset(pr, !is.na(ProsperScore))) +
geom_jitter(alpha = .05, shape = 21, colour = '#F79420') +
stat_smooth(method = 'lm')
Unsperisingly, applicants with higher prosperscore seem to have lower borrow rate.
with(pr, cor.test(pr$ProsperScore, pr$BorrowerAPR))
##
## Pearson's product-moment correlation
##
## data: pr$ProsperScore and pr$BorrowerAPR
## t = -261.68, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.6719940 -0.6645469
## sample estimates:
## cor
## -0.6682872
ggplot(aes( EmploymentStatus, BorrowerRate * 100 ), data = subset(pr, !is.na(ProsperScore))) +
geom_jitter(alpha = .05, shape = 21, colour = '#F79420') +
stat_smooth(method = 'lm')
ggplot(aes(pr$AvailableBankcardCredit/1000, pr$BorrowerRate*100 ),
data = subset(pr, (!is.na(BorrowerRate)))) +
# data = na.omit(pr)) +
geom_jitter(alpha =.5, shape = 21, colour = '#F79420') +
# xlim((quantile(AvailableBankcardCredit, 0.1), quantile(AvailableBankcardCredit, 0.80))) +
scale_x_continuous(limits = c(1, 150),breaks = seq(1, 150, 20) ) +
ylim(5,40)
## Warning: Removed 36065 rows containing missing values (geom_point).
Higer available bank card credit will also have lower borrow rate.
Other than main feature, I also noticed applicants with higher available bank card credit will also have lower borrow rate.
Borrow rate verus the properscore, The coefficient between them is -0.6682872.
ggplot(aes( BorrowerRate,ProsperScore), data = subset(pr, !is.na(ProsperScore)) ) +
geom_jitter(alpha = .05, shape = 21, colour = '#F79420') +
# theme(axis.text.x = element_text(angle = 60, hjust = 1))+
facet_wrap(~income_range)
ggplot(aes(BorrowerRate,ProsperScore ), data = subset(pr, !is.na(ProsperScore))) +
geom_jitter(alpha = .05, shape = 21, colour = '#099DD9') +
theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
facet_wrap(~EmploymentStatus)
The same pattern also occur for different employment status
ggplot(aes( ProsperScore,BorrowerRate* 100, colour = income_range), data = pr) +
geom_point(alpha = .5, size = .5, position = 'jitter') +
scale_color_brewer(type = 'div', guide = guide_legend(title = 'income_range',
reverse = FALSE,
override.aes = list(alpha = 1, size = 2))) +
scale_y_continuous(limits = c(quantile(pr$BorrowerRate * 100, .05), quantile(pr$BorrowerRate * 100, .90)))
## Warning: Removed 42364 rows containing missing values (geom_point).
When comparing with the main relationship ProsperScore and BorrowerRate, we also noticed a liner relationship adding the category variable income range
ggplot(aes(BorrowerRate*100, ProsperScore, colour = EmploymentStatus), data = pr) +
geom_point(alpha = .5, size = .75, position = 'jitter') +
scale_color_brewer(type = 'div', guide = guide_legend(title = 'EmploymentStatus',
reverse = FALSE,
override.aes = list(alpha = 1, size = 2))) +
scale_x_continuous(limits = c(quantile(pr$BorrowerRate* 100, .05), quantile(pr$BorrowerRate*100, .90)))
## Warning: Removed 42343 rows containing missing values (geom_point).
m1 <- lm(I(BorrowerRate) ~ 0 + I(ProsperScore), data = subset(pr, !is.na(ProsperScore)))
m2 <- update(m1, ~ . + AvailableBankcardCredit)
m3 <- update(m2, ~ . + IncomeRange)
m4 <- update(m3, ~ . + EmploymentStatus)
m5 <- update(m4, ~ . + TotalCreditLinespast7years)
mtable(m1, m2, m3, m4, m5)
##
## Calls:
## m1: lm(formula = I(BorrowerRate) ~ 0 + I(ProsperScore), data = subset(pr,
## !is.na(ProsperScore)))
## m2: lm(formula = I(BorrowerRate) ~ I(ProsperScore) + AvailableBankcardCredit -
## 1, data = subset(pr, !is.na(ProsperScore)))
## m3: lm(formula = I(BorrowerRate) ~ I(ProsperScore) + AvailableBankcardCredit +
## IncomeRange - 1, data = subset(pr, !is.na(ProsperScore)))
## m4: lm(formula = I(BorrowerRate) ~ I(ProsperScore) + AvailableBankcardCredit +
## IncomeRange + EmploymentStatus - 1, data = subset(pr, !is.na(ProsperScore)))
## m5: lm(formula = I(BorrowerRate) ~ I(ProsperScore) + AvailableBankcardCredit +
## IncomeRange + EmploymentStatus + TotalCreditLinespast7years -
## 1, data = subset(pr, !is.na(ProsperScore)))
##
## ============================================================================================================================
## m1 m2 m3 m4 m5
## ----------------------------------------------------------------------------------------------------------------------------
## I(ProsperScore) 0.026*** 0.028*** -0.018*** -0.019*** -0.019***
## (0.000) (0.000) (0.000) (0.000) (0.000)
## AvailableBankcardCredit -0.000*** -0.000*** -0.000*** -0.000***
## (0.000) (0.000) (0.000) (0.000)
## IncomeRange: $0 0.352*** 0.348*** 0.348***
## (0.008) (0.008) (0.008)
## IncomeRange: $1-24,999 0.334*** 0.336*** 0.336***
## (0.001) (0.001) (0.001)
## IncomeRange: $100,000+ 0.304*** 0.307*** 0.307***
## (0.001) (0.001) (0.001)
## IncomeRange: $25,000-49,999 0.317*** 0.319*** 0.318***
## (0.001) (0.001) (0.001)
## IncomeRange: $50,000-74,999 0.308*** 0.310*** 0.309***
## (0.001) (0.001) (0.001)
## IncomeRange: $75,000-99,999 0.306*** 0.309*** 0.308***
## (0.001) (0.001) (0.001)
## IncomeRange: Not employed 0.364*** 0.368*** 0.367***
## (0.002) (0.002) (0.002)
## EmploymentStatus: Full-time/Employed 0.025*** 0.025***
## (0.001) (0.001)
## EmploymentStatus: Other/Employed -0.003** -0.003**
## (0.001) (0.001)
## EmploymentStatus: Part-time/Employed 0.016*** 0.016***
## (0.003) (0.003)
## EmploymentStatus: Retired/Employed 0.020*** 0.020***
## (0.003) (0.003)
## EmploymentStatus: Self-employed/Employed -0.009*** -0.009***
## (0.001) (0.001)
## TotalCreditLinespast7years 0.000
## (0.000)
## ----------------------------------------------------------------------------------------------------------------------------
## R-squared 0.612 0.621 0.932 0.933 0.933
## adj. R-squared 0.612 0.621 0.932 0.933 0.933
## sigma 0.131 0.129 0.055 0.054 0.054
## F 133697.431 69606.628 128923.549 84690.770 79044.868
## p 0.000 0.000 0.000 0.000 0.000
## Log-likelihood 52266.161 53323.957 126092.434 126950.145 126950.686
## Deviance 1449.358 1413.669 254.367 249.276 249.273
## AIC -104528.323 -106641.915 -252164.869 -253870.290 -253869.372
## BIC -104509.626 -106613.869 -252071.382 -253730.059 -253719.793
## N 84853 84853 84853 84853 84853
## ============================================================================================================================
From multivariate analysis, the borrower’s rate is also affacted by other variables such as income range, employment status, aviable bank credits.
Holding the borrower’s rate constant, the employed status will have lower prosper score.
Yes, I did. For the linear models I created, R square is about 93%, which means almost 93% percent of the variation can be explained by the model. I also excluded the intercept, which strenthed the linear relationship between the variables.
## Warning: Ignoring unknown parameters: binwidth, bins, pad
## Warning: Removed 29084 rows containing non-finite values (stat_count).
The loans were valued by the risk levels which bing called as ProsperScore, and the greater the score the lower the risk. Histogram showed us the counts for different ProsperScore.Majority of the loans have the score between 4 and 9.
Ggplot gives us the negative correlation between ProsperScore and the borrower’s rate, the higher the score seems to lead to a lower borrower’s rate.
## Warning: Removed 42348 rows containing missing values (geom_point).
When considering other variables, for example, income range, will also help us in predicting the borrower’s rate.
Prosper Loan dataset has thorogh loan data regrading theri unique attributes, and when evaluating the loan applications, these variables could benefit the company in deciding the accurate rate. From the the analysis, it is shown that borrower’s rate the highly correlated with borrower’s ProsperScore, which measured the risk of the applicant. We also learned that other factors such as income level, employment status, avaible bank credits could also affect borrower’s rate. More thorough demographic data can be included in the dataset, therefore, we can better the detailed attributs of the applicants.